Back

Briefings in Bioinformatics

Oxford University Press (OUP)

All preprints, ranked by how well they match Briefings in Bioinformatics's content profile, based on 11 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
A Simple Mathematical Model for Estimating the Inflection Points of COVID-19 Outbreaks

Ma, Z.

2020-03-27 health informatics 10.1101/2020.03.25.20043893
Top 0.1%
281× avg
Show abstract

BackgroundExponential-like infection growths leading to peaks (which could be the inflection points or turning points) are usually the hallmarks of infectious disease outbreaks including coronaviruses. To predict the inflection points, i.e., inflection time (Tmax) & maximal infection number (Imax) of the novel coronavirus (COVID-19), we adopted a trial and error strategy and explored a series of approaches from simple logistic modeling (that has an asymptomatic line) to sophisticated tipping point detection techniques for detecting phase transitions but failed to obtain satisfactory results. MethodInspired by its success in diversity-time relationship (DTR), we apply the PLEC (power law with exponential cutoff) model for detecting the inflection points of COVID-19 outbreaks. The model was previously used to extend the classic species-time relationship (STR) for general DTR (Ma 2018), and it has two "secondary" parameters (computed from its 3 parameters including power law scaling parameter w, taper-off parameter d to overwhelm virtually exponential growth ultimately, and a parameter c related to initial infections): one that was originally used for estimating the potential or dark biodiversity is proposed to estimate the maximal infection number (Imax) and another is proposed to determine the corresponding inflection time point (Tmax). ResultsWe successfully estimated the inflection points [Imax, Tmax] for most provinces ({approx}85%) in China with error rates <5% in both Imax and Tmax. We also discussed the constraints and limitations of the proposed approach, including (i) sensitive to disruptive jumps, (ii) requiring sufficiently long datasets, and (iii) limited to unimodal outbreaks.

2
Explainable drug side effect prediction via biologically informed graph neural network

Huang, T.; Lin, K.-H.; Vieira, R. M.; Soares, J. C.; Jiang, X.; Kim, Y.

2023-06-05 pharmacology and therapeutics 10.1101/2023.05.26.23290615
Top 0.1%
269× avg
Show abstract

Early detection of potential side effects (SE) is a critical and challenging task for drug discovery and patient care. In-vitro or in-vivo approach to detect potential SEs is not scalable for many drug candidates during the preclinical stage. Recent advances in explainable machine learning may facilitate detecting potential SEs of new drugs before market release and elucidating the critical mechanism of biological actions. Here, we leverage multi-modal interactions among molecules to develop a biologically informed graph-based SE prediction model, called HHAN-DSI. HHAN-DSI predicted frequent and even uncommon SEs of the unseen drug with higher or comparable accuracy against benchmark methods. When applying HHAN-DSI to the central nervous system, the organs with the largest number of SEs, the model revealed diverse psychiatric medications previously unknown but probable SEs, together with the potential mechanisms of actions through a network of genes, biological functions, drugs, and SEs.

3
DeepDrug: An Expert-led Domain-specific AI-Driven Drug-Repurposing Mechanism for Selecting the Lead Combination of Drugs for Alzheimer's Disease

Li, V. O. K.; Han, Y.; Kaistha, T.; Zhang, Q.; Downey, J.; Gozes, I.; Lam, J. C. K.

2024-07-07 pharmacology and therapeutics 10.1101/2024.07.06.24309990
Top 0.1%
268× avg
Show abstract

Alzheimers Disease (AD) significantly aggravates human dignity and quality of life. While newly approved amyloid immunotherapy has been reported, effective AD drugs remain to be identified. Here, we propose a novel AI-driven drug-repurposing method, DeepDrug, to identify a lead combination of approved drugs to treat AD patients. DeepDrug advances drug-repurposing methodology in four aspects. Firstly, it incorporates expert knowledge to extend candidate targets to include long genes, immunological and aging pathways, and somatic mutation markers that are associated with AD. Secondly, it incorporates a signed directed heterogeneous biomedical graph encompassing a rich set of nodes and edges, and node/edge weighting to capture crucial pathways associated with AD. Thirdly, it encodes the weighted biomedical graph through a Graph Neural Network into a new embedding space to capture the granular relationships across different nodes. Fourthly, it systematically selects the high-order drug combinations via diminishing return-based thresholds. A five-drug lead combination, consisting of Tofacitinib, Niraparib, Baricitinib, Empagliflozin, and Doxercalciferol, has been selected from the top drug candidates based on DeepDrug scores to achieve the maximum synergistic effect. These five drugs target neuroinflammation, mitochondrial dysfunction, and glucose metabolism, which are all related to AD pathology. DeepDrug offers a novel AI-and-big-data, expert-guided mechanism for new drug combination discovery and drug-repurposing across AD and other neuro-degenerative diseases, with immediate clinical applications.

4
Estimating the selection pressure of tumor growth on tumor tissue microbiomes

Li, L.; Ma, Z.

2024-03-18 oncology 10.1101/2024.03.17.24304406
Top 0.1%
250× avg
Show abstract

BackgroundThe relationships between tumor and its microbiome are still puzzling, with possible bidirectional interactions. Tumor microbiomes may suppress or stimulate tumor growth on the one hand; on the other hand, tumor growth may exert selection pressure on its microbiomes. There is not any consensus on the mode and/or extension of the bidirectional interactions. The objective of this study is to estimate the selection pressure from the primary tumors on tumor microbiomes by comparing with the selection pressure from the solid normal tissues on their corresponding tissue microbiomes across 20+ cancer types. MethodsWe apply Sloan near neutral theory and big datasets of tumor tissue microbiomes from the TCGA (The Cancer Genome Atlas) databases to achieve the above objective. The near neutral theory model can determine the proportions of above-neutral, neutral and below-neutral species in microbial communities, corresponding with positive, neutral and negative selection pressures from host tissues. By comparing the proportions between the primary tumors and solid normal tissues, we can infer the selection pressure of tumor growth on tissue microbiomes. ResultsWe find that approximately 65% of species in solid normal tissue microbiomes are neutral, and the proportion is only 40% in the primary tumor microbiomes. In contrast, the proportion of positively selected species exceeds 60% in the primary tumor microbiomes. Furthermore, simulations with neutral theory model reveal that most abundant species are mostly neutral, while non-neutral species are in the long tail of the species abundance distributions. ConclusionsTumor growth exerts strong positive selection on resident microbiomes, driving the abundances of certain species above the levels expected by the neutral process. Nevertheless, neutral species are still among the most abundant species, suggesting the necessity to pay close attention to the low-abundance or rare species because they are likely to play a critical role in oncogenesis.

5
pyTCR: a comprehensive and scalable platform for TCR-Seq data analysis to facilitate reproducibility and rigor of immunogenomics research

Peng, K.; Moore, J.; Brito, J.; Kao, G.; Burkhardt, A. M.; Alachkar, H.; Mangul, S.

2022-05-27 genetic and genomic medicine 10.1101/2022.05.26.22275650
Top 0.1%
192× avg
Show abstract

T cell receptor (TCR) studies have grown substantially with the advancement in the sequencing techniques of T cell receptor repertoire sequencing (TCR-Seq). The analysis of the TCR-Seq data requires computational skills to run the computational analysis of TCR repertoire tools. However biomedical researchers with limited computational backgrounds face numerous obstacles to properly and efficiently utilizing bioinformatics tools for analyzing TCR-Seq data. Here we report pyTCR, a computational notebook-based platform for comprehensive and scalable TCR-Seq data analysis. Computational notebooks, which combine code, calculations, and visualization, are able to provide users with a high level of flexibility and transparency for the analysis. Additionally, computational notebooks are demonstrated to be user-friendly and suitable for researchers with limited computational skills. Our platform has a rich set of functionalities including various TCR metrics, statistical analysis, and customizable visualizations. The application of pyTCR on large and diverse TCR-Seq datasets will enable the effective analysis of large-scale TCR-Seq data with flexibility, and eventually facilitate new discoveries.

6
Detecting microbiome species unique or enriched in 20+ cancer types and building cancer microbiome heterogeneity networks

Ma, Z.; Li, L.; Mei, J.

2024-03-24 oncology 10.1101/2024.03.23.24304768
Top 0.1%
187× avg
Show abstract

It is postulated that tumor tissue microbiome is one of the enabling characteristics that either promote or suppress cancer cells and tumors to acquire certain hallmarks (functional traits) of cancers, which highlights their critical importance to carcinogenesis, cancer progression and therapy responses. However, characterizing the tumor microbiomes is extremely challenging because of their low biomass and severe difficulties in controlling laboratory-borne contaminants, which is further aggravated by lack of comprehensively effective computational approaches to identify unique or enriched microbial species associated with cancers. Here we take advantages of two recent computational advances, one by Poore et al (2020, Nature) that computationally generated the microbiome datasets of 33 cancer types [of 10481 patients, including primary tumor (PT), solid normal tissue (NT), and blood samples] from whole-genome and whole-transcriptome data deposited in "The Cancer Genome Atlas" (TCGA), another termed "specificity diversity framework" (SDF) developed recently by Ma (2023). By reanalyzing Poores datasets with the SDF framework, further augmented with complex network analysis, we produced the following catalogues of microbial species (archaea, bacteria and viruses) with statistical rigor including unique species (USs) and enriched species (ESs) in PT, NT, or blood tissues. We further reconstructed species specificity network (SSN) and cancer microbiome heterogeneity network (CHN) to identify core/periphery network structures, from which we gain insights on the codependency of microbial species distribution on landscape of cancer types, which seems to suggest that the codependency appears to be universal across all cancer types.

7
Network reinforcement driven drug repurposing for COVID-19 by exploiting disease-gene-drug associations

Nam, Y.; Yun, J.-S.; Lee, S. M.; Park, J. W.; Chen, Z.; Lee, B.; Verma, A.; Ning, X.; Shen, L.; Kim, D.

2020-08-14 pharmacology and therapeutics 10.1101/2020.08.11.20173120
Top 0.1%
162× avg
Show abstract

Currently, the number of patients with COVID-19 has significantly increased. Thus, there is an urgent need for developing treatments for COVID-19. Drug repurposing, which is the process of reusing already-approved drugs for new medical conditions, can be a good way to solve this problem quickly and broadly. Many clinical trials for COVID-19 patients using treatments for other diseases have already been in place or will be performed at clinical sites in the near future. Additionally, patients with comorbidities such as diabetes mellitus, obesity, liver cirrhosis, kidney diseases, hypertension, and asthma are at higher risk for severe illness from COVID-19. Thus, the relationship of comorbidity disease with COVID-19 may help to find repurposable drugs. To reduce trial and error in finding treatments for COVID-19, we propose building a network-based drug repurposing framework to prioritize repurposable drugs. First, we utilized knowledge of COVID-19 to construct a disease-gene-drug network (DGDr-Net) representing a COVID-19-centric interactome with components for diseases, genes, and drugs. DGDr-Net consisted of 592 diseases, 26,681 human genes and 2,173 drugs, and medical information for 18 common comorbidities. The DGDr-Net recommended candidate repurposable drugs for COVID-19 through network reinforcement driven scoring algorithms. The scoring algorithms determined the priority of recommendations by utilizing graph-based semi-supervised learning. From the predicted scores, we recommended 30 drugs, including dexamethasone, resveratrol, methotrexate, indomethacin, quercetin, etc., as repurposable drugs for COVID-19, and the results were verified with drugs that have been under clinical trials. The list of drugs via a data-driven computational approach could help reduce trial-and-error in finding treatment for COVID-19.

8
Blinatumomab Trimer Formation: Insights From A Mechanistic PKPD Model Into The Implications For Switching From Infusion To Subcutaneous Dosing Regimen

Kapitanov, G. I.; Head, S. A.; Flowers, D.; Apgar, J. F.; Grant, J.

2024-03-13 pharmacology and therapeutics 10.1101/2024.03.11.24304117
Top 0.1%
161× avg
Show abstract

Blinatumomab is a bispecific T-cell engager (BiTE) that binds to CD3 on T cells and CD19 on B cells. It has been approved for use in B-cell acute lymphoblastic leukemia (B-ALL) with a regimen that requires continuous infusion (cIV) for four weeks per treatment cycle. It is currently in clinical trials for Non-Hodgkin lymphoma (NHL) with cIV administration. Recently, there have been studies investigating dose-response after subcutaneous (SC) dosing in B-ALL and in NHL to determine whether this more convenient method of delivery would have a similar efficacy/safety profile as continuous infusion. We constructed mechanistic PKPD models of blinatumomab activity in B-ALL and NHL patients, investigating the amount of CD3:blinatumomab:CD19 trimers the drug forms at different dosing administrations and regimens. The modeling and analysis demonstrate that the explored SC doses in B-ALL and NHL achieve similar trimer numbers as the cIV doses in those indications. We further simulated various subcutaneous dosing regimens, and identified conditions where trimer formation dynamics are similar between constant infusion and subcutaneous dosing. Based on the model results, subcutaneous dosing is a viable and convenient strategy for blinatumomab and is projected to result in similar trimer numbers as constant infusion.

9
Bioinformatic Analysis of Defective Viral Genomes in SARS-CoV-2 and Its Impact on Population Infection Characteristics

Xu, Z.; Peng, Q.; Song, J.; Zhang, H.; Wei, D.; Demongeot, J.

2023-10-05 infectious diseases 10.1101/2023.10.05.23296580
Top 0.1%
159× avg
Show abstract

DVGs (Defective Viral Genomes) and SIP (Semi-Infectious Particle) are commonly present in RNA virus infections. In this study, we analyzed high-throughput sequencing data and found that DVGs or SIPs are also widely present in SARS-CoV-2. Comparison of SARS-CoV-2 with various DNA viruses revealed that the SARS-CoV-2 genome is more susceptible to damage and has greater sequencing sample heterogeneity. Variability analysis at the whole-genome sequencing depth showed a higher coefficient of variation for SARS-CoV-2, and DVG analysis indicated a high proportion of splicing sites, suggesting significant genome heterogeneity and implying that most virus particles assembled are enveloped with incomplete RNA sequences. We further analyzed the characteristics of different strains in terms of sequencing depth and DVG content differences and found that as the virus evolves, the proportion of intact genomes in virus particles increases, which can be significantly reflected in third-generation sequencing data, while the proportion of DVG gradually decreases. Specifically, the proportion of intact genome of Omicron was greater than that of Delta and Alpha strains. This can well explain why Omicron strain is more infectious than Delta and Alpha strains. We also speculate that this improvement in completeness is due to the enhancement of virus assembly ability, as the Omicron strain can quickly realize the binding of RNA and capsid protein, thereby shortening the exposure time of exposed virus RNA in the host environment and greatly reducing its degradation level. Finally, by using mathematical modeling, we simulated how DVG effects under different environmental factors affect the infection characteristics and evolution of the population. We can explain well why the severity of symptoms is closely related to the amount of virus invasion and why the same strain causes huge differences in population infection characteristics under different environmental conditions. Our study provides a new approach for future virus research and vaccine development.

10
The Great Genotyper: A Graph-Based Method for Population Genotyping of Small and Structural Variants

Shokrof, M.; Abuelanin, M.; Brown, C. T.; Mansour, T. A.

2024-07-05 genetic and genomic medicine 10.1101/2024.07.04.24309921
Top 0.1%
144× avg
Show abstract

1Long-read sequencing (LRS) enables variant calling of high-quality structural variants (SVs). Genotypers of SVs utilize these precise call sets to increase the recall and precision of genotyping in short-read sequencing (SRS) samples. With the extensive growth in availabilty of SRS datasets in recent years, we should be able to calculate accurate population allele frequencies of SV. However, reprocessing hundreds of terabytes of raw SRS data to genotype new variants is impractical for population-scale studies, a computational challenge known as the N+1 problem. Solving this computational bottleneck is necessary to analyze new SVs from the growing number of pangenomes in many species, public genomic databases, and pathogenic variant discovery studies. To address the N+1 problem, we propose The Great Genotyper, a population genotyping workflow. Applied to a human dataset, the workflow begins by preprocessing 4.2K short-read samples of a total of 183TB raw data to create an 867GB Counting Colored De Bruijn Graph (CCDG). The Great Genotyper uses this CCDG to genotype a list of phased or unphased variants, leveraging the CCDG population information to increase both precision and recall. The Great Genotyper offers the same accuracy as the state-of-the-art genotypers with the addition of unprecedented performance. It took 100 hours to genotype 4.5M variants in the 4.2K samples using one server with 32 cores and 145GB of memory. A similar task would take months or even years using single-sample genotypers. The Great Genotyper opens the door to new ways to study SVs. We demonstrate its application in finding pathogenic variants by calculating accurate allele frequency for novel SVs. Also, a premade index is used to create a 4K reference panel by genotyping variants from the Human Pangenome Reference Consortium (HPRC). The new reference panel allows for SV imputation from genotyping microarrays. Moreover, we genotype the GWAS catalog and merge its variants with the 4K reference panel. We show 6.2K events of high linkage between the HPRCs SVs and nearby GWAS SNPs, which can help in interpreting the effect of these SVs on gene functions. This analysis uncovers the detailed haplotype structure of the human fibrinogen locus and revives the pathogenic association of a 28 bp insertion in the FGA gene with thromboembolic disorders.

11
Mathematical Model of a Personalized Neoantigen Cancer Vaccine and the Human Immune System: Evaluation of Efficacy

Rodriguez Messan, M.; Yogurtcu, O. N.; McGill, J. R.; Nukala, U.; Sauna, Z. E.; Yang, H.

2021-01-09 pharmacology and therapeutics 10.1101/2021.01.08.21249452
Top 0.1%
138× avg
Show abstract

Cancer vaccines are an important component of the cancer immunotherapy toolkit enhancing immune response to malignant cells by activating CD4+ and CD8+ T cells. Multiple successful clinical applications of cancer vaccines have shown good safety and efficacy. Despite the notable progress, significant challenges remain in obtaining consistent immune responses across heterogeneous patient populations, as well as various cancers. We present as a proof of concept a mechanistic mathematical model describing key interactions of a personalized neoantigen cancer vaccine with an individual patients immune system. Specifically, the model considers the vaccine concentration of tumor-specific antigen peptides and adjuvant, the patients major histocompatibility complexes I and II copy numbers, tumor size, T cells, and antigen presenting cells. We parametrized the model using patient-specific data from a recent clinical study in which individualized cancer vaccines were used to treat six melanoma patients. Model simulations predicted both immune responses, represented by T cell counts, to the vaccine as well as clinical outcome (determined as change of tumor size). These kinds of models have the potential to lay the foundation for generating in silico clinical trial data and aid the development and efficacy assessment of personalized cancer vaccines. Author summaryPersonalized cancer vaccines have gained attention in recent years due to the advances in sequencing techniques that have facilitated the identification of multiple tumor-specific mutations. This type of individualized immunotherapy has the potential to be specific, efficacious, and safe since it induces an immune response to protein targets not found on normal cells. This work focuses on understanding and analyzing important mechanisms involved in the activity of personalized cancer vaccines using a mechanistic mathematical model. This model describes the interactions of a personalized neoantigen peptide cancer vaccine, the human immune system and tumor cells operating at the molecular and cellular level. The molecular level captures the processing and presentation of neoantigens by dendritic cells to the T cells using cell surface proteins. The cellular level describes the differentiation of dendritic cells due to peptides and adjuvant concentrations in the vaccine, activation, and proliferation of T cells in response to treatment, and tumor growth. The model captures immune response behavior to a vaccine associated with patient specific factors (e.g., different initial tumor burdens). Our model serves as a proof of concept displaying its utility in clinical outcomes prediction, lays foundation for developing in silico clinical trials, and aids in the efficacy assessment of personalized vaccines.

12
National Consumption of Antimalarial Drugs and COVID-19 Deaths Dynamics : an Ecological Study

Izoulet, M.

2020-04-24 pharmacology and therapeutics 10.1101/2020.04.18.20063875
Top 0.1%
130× avg
Show abstract

COVID-19 (Coronavirus Disease-2019) is an international public health problem with a high rate of severe clinical cases. Several treatments are currently being tested worldwide. This paper focuses on anti-malarial drugs such as chloroquine or hydroxychloroquine. We compare the dynamics of COVID-19 daily deaths in countries using anti-malaria drugs as a treatment from the start of the epidemic versus countries that do not, the day of the 3rd death and the following 10 days. We then use a ARIMA modeling to realize a short-term forecast of deaths dynamics for each group. We show that the first group have a much slower dynamic in daily deaths that the second group. This ecological study is of course only one additional piece of evidence in the debate regarding the efficiency of anti-malaria drugs, and it is also limited as the two groups certainly have other systemic differences in the way they responded to the pandemic, in the way they report death or in their population that better explain differences in dynamics. Nevertheless, the difference in dynamics of daily deaths is so striking that we believe it is useful to present these results as a clue in the researches about the efficiency of hydroxychloroquine. In the end, this data might ultimately be either a piece of evidence in favor or anti-malaria drugs or a stepping stone in understanding further what other ecological aspects place a role in the dynamics of COVID-19 deaths.

13
Discovery of Dynamic Models for AML Disease Progression from Longitudinal Multi-Modal Clinical Data Using Explainable Machine Learning

Mousavi, R.; Mustafa Ali, M. K.; Lobo, D.

2025-04-09 oncology 10.1101/2025.04.07.25325267
Top 0.1%
126× avg
Show abstract

Acute Myeloid Leukemia (AML) is a complex and heterogeneous disease identified by severe clinical progression, fast cellular proliferation, and often high mortality rates. Incorporating diverse longitudinal information on patients medical histories is essential for developing effective disease predictive models applicable to both research and clinical settings. Here, we present a robust methodology for discovering dynamic predictive models to elucidate AML disease progression dynamics from a novel longitudinal multimodal clinical dataset of patients diagnosed with AML. The clinical dataset was analyzed to reveal the main clinical, genetic, and treatment features modulating disease progression. To discover mathematical models--including interactions, parameters, and nodes--predictive of AML progression, we present an explainable machine learning algorithm based on high-performance evolutionary computation. The results demonstrate that the predictive methodology could accurately estimate the clinical dynamics of AML progression in terms of blast percentages for both training and novel patients. This study demonstrates that the developed explainable machine learning approach can successfully predict AML progression by leveraging the heterogeneous and longitudinal dynamics of patients clinical data. More importantly, this methodology shows significant potential for application in modeling the progression dynamics of other acute diseases, providing a flexible and adaptable framework for advancing clinical and translational research.

14
Simulate Scientific Reasoning with Multiple Large Language Models: An Application to Alzheimer's Disease Combinatorial Therapy

Xu, Q.; Liu, X.; Jiang, X.; Kim, Y.

2024-12-12 pharmacology and therapeutics 10.1101/2024.12.10.24318800
Top 0.1%
120× avg
Show abstract

MotivationThis study aims to develop an AI-driven framework that leverages large language models (LLMs) to simulate scientific reasoning and peer review to predict efficacious combinatorial therapy when data-driven prediction is infeasible. ResultsOur proposed framework achieved a significantly higher accuracy (0.74) than traditional knowledge-based prediction (0.52). An ablation study highlighted the importance of high quality few-shot examples, external knowledge integration, self-consistency, and review within the framework. The external validation with private experimental data yielded an accuracy of 0.82, further confirming the frameworks ability to generate high-quality hypotheses in biological inference tasks. Our framework offers an automated knowledge-driven hypothesis generation approach when data-driven prediction is not a viable option. Availability and implementationOur source code and data are available at https://github.com/QidiXu96/Coated-LLM

15
Channel Capacity of Genome-Wide Cell-Free DNA Fragment Length Distribution in Colorectal Cancer

Matov, A.

2024-07-18 oncology 10.1101/2024.07.17.24310568
Top 0.1%
119× avg
Show abstract

IntroductionEach piece of cell-free DNA (cfDNA) has a length determined by the exact metabolic conditions in the cell it belonged to at the time of cell death. The changes in cellular regulation leading to a variety of patterns, which are based on the different number of fragments with lengths up to several hundred base pairs (bp) at each of the almost three billion genomic positions, allow for the detection of disease and also the precise identification of the tissue of their origin. MethodsA Kullback-Leibler (KL) divergence computation identifies different fragment lengths and areas of the human genome, depending on the stage, for which disease samples, starting from pre-clinical disease stages, diverge from healthy individual samples. We provide examples of genes related to colorectal cancer (CRC), which our algorithm detected to belong to divergent genomic bins. The staging of CRC can be viewed as a Markov chain and that provides a framework for studying disease progression and the types of epigenetic changes occurring longitudinally at each stage, which might aid the correct classification of a new hospital sample. ResultsIn a new look to treat such data as grayscale value images, pattern recognition using artificial intelligence could be one approach to classification. In CRC, Stage I disease does not, for the most part, shed any tumor in circulation, making detection difficult for established machine learning (ML) methods. This leads to the deduction that early detection, where we can only rely on changes in the metabolic patterns, can be accomplished when the information is considered in its entirety, for example by applying computer vision methods. ConclusionsLongitudinal analysis of patients genetic datasets can detect the early stages of neoplasm better than population-based methods.

16
AImmune: a new blood-based machine learning approach to improving immune profiling analysis on COVID-19 patients

Zhang, X. T.; Han, R. H.

2021-12-01 genetic and genomic medicine 10.1101/2021.11.26.21266883
Top 0.1%
119× avg
Show abstract

A massive number of transcriptomic profiles of blood samples from COVID-19 patients has been produced since pandemic COVID-19 begins, however, these big data from primary studies have not been well integrated by machine learning approaches. Taking advantage of modern machine learning arthrograms, we integrated and collected single cell RNA-seq (scRNA-seq) data from three independent studies, identified genes potentially available for interpretation of severity, and developed a high-performance deep learning-based deconvolution model AImmune that can predict the proportion of seven different immune cells from the bulk RNA-seq results of human peripheral mononuclear cells. This novel approach which can be used for clinical blood testing of COVID-19 on the ground that previous research shows that mRNA alternations in blood-derived PBMCs may serve as a severity indicator. Assessed on real-world data sets, the AImmune model outperformed the most recognized immune profiling model CIBERSORTx. The presented study showed the results obtained by the true scRNA-seq route can be consistently reproduced through the new approach AImmune, indicating a potential replacing the costly scRNA-seq technique for the analysis of circulating blood cells for both clinical and research purposes.

17
Deciphering the tissue-specific functional effect of Alzheimer risk SNPs with deep genome annotation

Pugalenthi, P. V.; Xie, L.; He, B.; Nho, K.; Saykin, A. J.; Yan, J.

2023-10-23 genetic and genomic medicine 10.1101/2023.10.23.23297399
Top 0.1%
118× avg
Show abstract

Alzheimers disease (AD) is a highly heritable brain dementia, along with substantial failure of cognitive function. Large-scale genome-wide association studies (GWAS) have led to a significant set of SNPs associated with AD and related traits. GWAS hits usually emerge as clusters where a lead SNP with the highest significance is surrounded by other less significant neighboring SNPs. Although functionality is not guaranteed with even the strongest associations in the GWAS, the lead SNPs have been historically the focus of the field, with the remaining associations inferred as redundant. Recent deep genome annotation tools enable the prediction of function from a segment of DNA sequence with significantly improved precision, which allows in-silico mutagenesis to interrogate the functional effect of SNP alleles. In this project, we explored the impact of top AD GWAS hits on the chromatin functions, and whether it will be altered by the genomic context (i.e., alleles of neighborhood SNPs). Our results showed that highly correlated SNPs in the same LD block could have distinct impact on the downstream functions. Although some GWAS lead SNPs showed dominating functional effect regardless of the neighborhood SNP alleles, several other ones do get enhanced loss or gain of function under certain genomic context, suggesting potential extra information hidden in the LD blocks.

18
ChatGPT as a bioinformatic partner.

Mondillo, G.; Perrotta, A.; Colosimo, S.; Frattolillo, V.

2024-08-20 health informatics 10.1101/2024.08.20.24312291
Top 0.1%
114× avg
Show abstract

The advanced Large Language Model ChatGPT4o, developed by OpenAI, can be used in the field of bioinformatics to analyze and understand cross-reactive allergic reactions. This study explores the use of ChatGPT4o to support research on allergens, particularly in the cross-reactivity syndrome between cat and pork. Using a hypothetical clinical case of a child with a confirmed allergy to Fel d 2 (cat albumin) and Sus s 1 (pork albumin), the model guided data collection, protein sequence analysis, and three-dimensional structure visualization. Through the use of bioinformatics tools like SDAP 2.0 and BepiPRED, the epitope regions of the allergenic proteins were predicted, confirming their accessibility to immunoglobulin E (IgE) and probability of cross-reactivity. The results show that regions with high epitope probability exhibit high surface accessibility and predominantly coil and helical structures. The construction of a phylogenetic tree further supported the evolutionary relationships among the studied allergens. ChatGPT4o has demonstrated its usefulness in guiding non-specialist researchers through complex bioinformatics processes, making advanced science accessible and improving analytical and innovation capabilities.

19
Establishment of in silico prediction of adjuvant chemotherapy response from active mitotic gene signature in non-small cell lung cancer

Kwon, E.-J.; Hwang, H. S.; Chang, E.; An, J.-Y.; Cha, H.-J.

2025-03-15 pharmacology and therapeutics 10.1101/2025.03.14.25322930
Top 0.1%
114× avg
Show abstract

Conventional chemotherapeutics exploit cancers hallmark of active cell cycling, primarily targeting mitotic cells. Consequently, the mitotic index (MI), representing the proportion of cells in mitosis, serves as both a prognostic biomarker for cancer progression and a predictive marker for chemo-responsiveness. In this study, we developed a transcriptome signature to predict the chemotherapeutic responsiveness based on the Active Mitosis Signature Enrichment Score (AMSES), a computational metric previously established to estimate the active mitosis using multi-omics data from The Cancer Genome Atlas (TCGA) lung cancer cohorts, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) patients. Leveraging advanced machine learning techniques, we enhanced the predictive power of AMSES and developed AMSES for chemo-responsiveness, termed A4CR. Comparative analysis revealed a strong correlation between A4CR and the MI of 69 cases from separated non-small cell lung cancer (NSCLC) cohort. The utility of A4CR as a therapeutic biomarker was validated through in silico analysis of public datasets, encompassing transcriptomic profiles of cancer cell lines (CCLs) and their corresponding multiple drug response data as well as clinicogenomic data from TCGA. These findings highlight the potential of integrating gene signatures with machine learning and large-scale datasets to advance precision oncology and improve therapeutic decision-making for cancer patients.

20
Decomposing patient heterogeneity of single-cell cancer data by cross-attention neural networks

Subedi, S.; Park, Y. P.

2025-06-06 oncology 10.1101/2025.06.04.25328900
Top 0.1%
113× avg
Show abstract

Gene expression variation in cancer cells is attributed to many inherited and environmental factors, including genetic variants and cellular landscapes. Decomposing different sources of information is intractable with single-cell RNA-seq alone. However, we show that our new approach can split them with the help of multiple patients, assuming that cell types are widely shared and genetic effects are specifically present in a particular patient. Our approach based on a cross-attention neural network was applied to three different cancer types to identify cell types and patient-specific genetic effects in transcriptomic data. Residual expressions, excluding cell types, can implicate patient-specific disease mechanisms.